Integrating local and global information to identify influential nodes in complex networks

Centrality analysis is a crucial tool for understanding the role of nodes in a network, but it is unclear how different centrality measures provide much unique information. To improve the identification of influential nodes in a network, we propose a new method called Hybrid-GSM (H-GSM) that combines the K-shell decomposition approach and Degree Centrality. H-GSM characterizes the impact of nodes more precisely than the Global Structure Model (GSM), which cannot distinguish the importance of each node. We evaluate the performance of H-GSM using the SIR model to simulate the propagation process of six real-world networks. Our method outperforms other approaches regarding computational complexity, node discrimination, and accuracy. Our findings demonstrate the proposed H-GSM as an effective method for identifying influential nodes in complex networks.

Complex networks refer to intricate systems composed of interconnected elements, such as nodes and edges, where the interactions between these elements exhibit non-trivial patterns 1 . To comprehensively analyze complex networks, researchers often explore them from four essential perspectives: path analysis, connectivity, community, and centrality. These analytical approaches shed light on the intricate pathways, structural connections, cohesive groups, and influential nodes within the network, enabling a holistic understanding of its dynamics and characteristics 2 . One of the most essential and challenging research challenges in network science is determining and prioritizing the most important nodes. The process of discovering and ranking the most influential nodes (INs) is critical for gaining a thorough view of a network's structure and operation 3 . Several centrality metrics have been presented throughout the years to capture a network's rank based on node degree and importance in the network's structure [4][5][6][7] . It is considered that the efficiency of a centrality measure in finding key nodes is dependent on its topological significance 8 . Currently, a website (http:// www. centi server. org) 9 has documented that there were approximately 403 centrality indices, providing a comprehensive resource for network analysis. However, despite this vast compilation, the exploration of identifying the most (INs) within complex networks remains an ongoing pursuit.
In latest years, methodologies for locating prominent nodes have gotten more targeted, relying only on global or local data. For example, K-shell decomposition (Ks) 10,11 and the Degree Centrality (DC) 12 approach is two of the most thoroughly explored interpretations of global and local information, respectively. Because of their simplicity, these two techniques have achieved broad use in networks of all sizes. However, Ks and DC have limits in determining the relative significance of nodes in a network.
In Ks, first and foremost, previous knowledge about the value of k is required, which may not be easily accessible 13 . Second, since the Ks is based on local connection 14 , it may not be useful in recognising the hierarchical structure of networks. Finally, it may not correctly represent the underlying structural aspects of the network since it may not capture the relevance of nodes that operate as bridges between various levels. It may be susceptible to the particular technique used to calculate the Ks, making it less accurate for comparing networks [15][16][17] .
On the other hand, DC is a standard network analysis metric that rates a node's relevance based on the number of edges (links) it has in the network. One issue is that it does not consider how excellent or crucial the relationships are [18][19][20] . Nodes with a high degree of centrality may have numerous connections, but those connections may be with nodes that are not central or significant in their own right 21 . Moreover, degree centrality does not take into account a node's structural location in the network 22 , which might influence its relevance. Lastly, degree centrality is less effective in directed networks with incoming and outward edges that need differentiation between in-degree and out-degree centrality measurements 20,22,23 . Overall, degree centrality is a valuable metric for detecting INs; however, it should be used in conjunction with other measures that capture other characteristics of a node's significance in a network 14,21,22 .
Other common centrality methods, such as betweenness centrality (BC) 24 and closeness centrality (CC) 25 , which estimate node impact based on global network information and may give higher ranking results, have a significant processing cost 26,27 . Geographical information, both local and global, may have a substantial influence on the power of INs in a network. 17,18,[28][29][30][31] Researchers have focused on local and global network information to solve the problem of identifying INs 17,18,[28][29][30] . Unfortunately, past traditional identification approaches frequently missed critical information, failing to account for global and local network information simultaneously 7,31 . As a result, the outcomes are often skewed. Information about the neighbour nodes do increase the accuracy and correctness of a method 32 . Researchers discovered that combining both in a network improved the identification of influential nodes. This integration enhanced detection at both the local (in community or cluster) and global levels 7,[33][34][35] . Taking the coreness and shortest distance between nodes into account might improve the discovery of INs.
Recently, an innovative approach is being introduces which is the Global Structure Model (GSM) 29 and its improved version, IGSM 18 , to identify INs in these networks. These approaches apply local and global information, which are Ks and DC, respectively. Yet, one key weakness of both methodologies is their inability to quantify the significance of individual nodes, leaving a large vacuum in our knowledge of complex networks. As the need for more accurate and extensive network research grows, new approaches that may overcome this constraint and give deeper insights into the structure of complex networks must be developed.
The primary contributions of this paper lie in the development and application of the Hybrid Global Structure Model (H-GSM). The H-GSM algorithm addresses the deficiencies of current techniques by considering both local and global information of each node, resulting in a more comprehensive understanding of the overall structure of complex networks. Specifically, our contributions are as follows: Overall, the H-GSM algorithm contributes to the advancement of network analysis by offering a novel approach that combines local and global influences, outperforming existing centrality measures, and providing superior scalability for large-scale network analysis.
The rest of this paper is organised as follows: In the section titled "Method", a brief introduction of numerous baseline approaches and the suggested H-GSM method are explained in detail. Next, a total of six actual networks data from the real-world case study have been adopted and used to validate the proposed method, which are described in the sections titled "Datasets and evaluation criteria" and "Results and discussions" respectively. This study's findings and recommendations for the future work are presented in the final section, titled "Conclusion and Future Recommendations".

Background analysis. Suppose a network is denoted as
where V is the set of nodes and E represents the edges. If there is an edge between node i and node j, then a ij = 1 they are directly connected, while if there is no edge, then a ij = 0 they are not directly connected. The total number of nodes in the network is denoted as n. The indices that use in this study are introduced in this section.
Degree centrality (DC). The number of nodes close to or directly linked to a node is denoted by DC, which is the most basic form of centrality. DC reflects on node information at the most local level, which is straightforward and intuitive. The higher the degree, the bigger the effect of the node. A node's degree centrality formula is as follows: www.nature.com/scientificreports/ Betweenness centrality (BC). The BC of a node is the ratio of the shortest pathways via the node to the total number of quickest routes. BC computes INs based on global data. A node with a high BC value serves an important function in linking various areas of the network. BC stands for where g jk indicates the number of paths and g jk (i) represents the shortest paths between nodes j and k through a node i.
Closeness centrality (CC). CC also computes prominent nodes based on global data. CC indicates a node's proximity to all other nodes in the network. It uses the shortest distance ( d ij ) between each pair of nodes to identify the influence of each node. CC of a node is defined as K-shell decomposition (Ks) method. Ks is one of the global centrality approaches for determining the core location of a network. Ks gives an index to each network node by deleting nodes repeatedly depending on their degree. Nodes with one connection are removed, and the network's degree value is recalculated. Stripping additional degree nodes continues until no more nodes can be stripped. A node with a higher Ks-value is more significant in the network and should be given more attention or consideration when interpreting the model or making choices based on its predictions. The Ks metric indicates that a cluster of nodes will exhibit comparable significance within a network 11 , yet it falls short in equitably distinguishing the nodes that possess greater influences.
Global structure model (GSM). GSM  Proposed method. The approach suggested in this study outperforms the GSM and IGSM methods already in use to identify INs in a network. The algorithm employs two indices: DC and Ks, and the suggested technique takes into account node position information. This is due to the fact that node placement is important in data distribution, and nodes in crucial places may have a stronger effect on the flow of information or resources within the network. The suggested technique offers a more complete approach to identifying INs in a network by combining these measurements and including location information.
To enhance the notion of node influence, we used the GSM's concept of self-and global impact, but applied it in a creative way. By increasing a node's DC by its Ks value, we established enhanced self-influence (iSI). This iSI factor was then used to calculate enhanced global impact (iGI), which takes into account all nodes that are directly or indirectly related to a node. The iGI factor is the sum of the neighbour ratios of the shortest route lengths for directed and undirected nodes with Ks and DC values. Its shortest route length is calculated by the average iSI value and is also referred to as information loss. In assessing node impact, the proposed H-GSM method takes into account both iSI and iGI parameters.
This suggested approach is significant because it can more accurately quantify how nodes in a network impact one another. Our strategy considers a node's local and global impacts, as well as how it affects other nodes in the network and the network as a whole. Consequently, including node position information aids in better understanding of data dispersion throughout the network. The new technique is likely to outperform current methods in terms of node determination, making it an important contribution to the area of network analysis. The complete equation is as follows: www.nature.com/scientificreports/ Computation process. Figure 1 depicts a basic network with 7 nodes and 10 edges segregated by its k-shell territory to further clarify the specific calculation procedure of the H-GSM algorithm. As indicated in the network, we consider the H-GSM approach by using node 3 as an example of the targeted node, hence i = 3. Node 3 is positioned on the third layer, designated by Ks = 3, as are nodes 0, 1, and 2. In terms of DC value, node 3 obviously has five edges attached to it, resulting in DC-value = 5. We begin by computing the Ks, DC, and shortest distance between each node.
Step 1: Determine Ks and DC value.
Step 4: Calculate node influence of H-GSM. www.nature.com/scientificreports/ As illustrated in Fig. 1, Table 1 presents node rankings based on the implementation of the DC, BC, CC, Ks, GSM, IGSM, and H-GSM methodologies. Earlier works such as DC, KS, BC, and CC could only distinguish between six levels. Nonetheless, DC is better at level discrimination than KS and BC. The node rating for CC and GSM is the same. In finding the most INs in the network, both H-GSM and IGSM outperform standard GSM. For example, in GSM, rank 2 cannot tell which node is more significant between nodes 1 and 2. Yet, when it comes to distinguishing nodes, H-GSM surpasses IGSM. For example, given the greater value, the distinction between nodes 2 and 1 in H-GSM is obvious.
Datasets and evaluation criteria. Datasets. In this article, we experiment with several unweighted and undirected graphs of varying scales. We examine algorithm performance in terms of running time and influence spread and compare it to that of other algorithms. We apply the Susceptible-Infected-Recovered (SIR) epidemic model as a benchmark simulator over six real networks, including USAir97 36 , Netscience and its largest component subgraph (Netscience1) 37 , Email 38 , Yeast 39 , and Router 40 , in order to compare the performance of the proposed H-GSM with the other indexing methods. Table 2 lists some elementary statistics regarding these networks, including their total number of nodes (n), the total number of edges (m), maximum and minimum degree (d max and d min ), and maximum core value (core max ). SIR model. SIR is a strategy that divides the population into three categories: susceptible (S), infected (I), and recovered (R). Just one node is chosen to be infected in each implementation, while the other nodes are set as vulnerable at each separate run. The seed node infects its neighbors with varying spreading probability α,and will recover from the infection with the probabilities β . Each loop is viewed as a time step t, and F(t) gives the number of nodes infected at time t, which is used to evaluate the first infected node's effect 41 . When none of the nodes remain diseased, the spreading process comes to an end. The same processes are performed for each node in each network, with 500 iterations.
Kendall coefficient. Kendall's coefficient is used to determine how well-simulated rankings by the SIR model match the true rankings reached by centrality measures 42 . Kendall's coefficient compares the similarity and consistency of two sequences. If the list of ranking strategy corresponds more strongly with the list of rated nodespreading abilities in the SIR model, the ranking method is more successful. The more INs has a larger capacity to propagate. Assume a network comprises n vertices, with n c and n d representing the number of concordant and discordant pairs, respectively. The formula for calculating Kendall's coefficient is as follows:  www.nature.com/scientificreports/ The greater the τ number, the more precise the ranked list generated by the ranking system. In the optimal scenario, τ = 1 , the approach and the actual spreading process have identical ranking lists. With a large α value, the spreading would encompass nearly the whole network. In this experiment, the SIR model's spreading probability gradually increases from 0.01 to 0.1.

Results and discussions
Computational complexity. We determine how difficult it is to utilise our strategy in order to demonstrate how well it works before discussing how well it works. Computational complexity, often known as algorithm complexity, is the amount of time or space required by an algorithm for a given input size. As previously stated, the method reveals that H-GSM is composed of three major components. Before executing the iSI formula with an O(time) complexity, the first stage determines the DC and Ks values. The second step began with the implementation of Dijkstra's shortest path length, signified by complexity, O(n 2 ) to determine the value of global influence (iGI) based on the values of DC and Ks. After that, the third step is performed to identify INs, which is the multiplication of iSI and iGI. Because our method is an improvement of GSM, we assumed that the overall computing time of H-GSM is also O(n 2 ).
We demonstrate the computational difficulty of our technique by benchmarking its execution against six networks. In terms of execution time, our approach surpasses DC, BC, CC, GSM, and IGSM, as shown in Table 3. Whilst DC is commonly touted to have the simplest form and be the easiest to calculate, in terms of time execution, H-GSM exceeds DC. The techniques in this work are implemented on a Windows 11 platform 64-bit system; the machine hardware configuration is an Intel® Core i7-8550U CPU @ 2.4 Hz processor, 24 GB of RAM; and Python-Visual Studio Code 1.56.2 is used for programming.

Nodes spreading and discrimination comparison.
We use the SIR model, which is extensively used in network epidemic dynamics, to quantify node impact contributions by examining their spread and differences across nodes. SIR decides which nodes may spread out faster and wider over time. In general, the importance of a node is proportionate to its capacity to expand. The node with the greatest spreading capacity has the most power. Table 4 shows the top 10 nodes in six networks with great spreading capability using the DC, BC, CC, GSM, IGSM, and H-GSM methodologies. As illustrated in Fig. 2, the ten nodes for each strategy were then used as seed nodes to monitor the nodes' convergence in propagation capacities. The number of infected nodes, F(t), increases with time and quickly stabilises. H-GSM has highly effective propagation throughout the majority of the network.
After that, we analyse if the existence of a node in H-GSM effects its propagation. To analyse H-convergences, GSM's a node from H-GSM that was not existent in GSM is compared with a node from GSM, as shown in Table 4. The following are the findings: USAir97. All metrics had the same initial affected nodes in terms of node rank. When we looked at GSM, IGSM, and H-GSM, we discovered that their top 10 nodes were identical except for the order. When comparing H-GSM with GSM, the top five nodes are the same, but the sixth varies. As shown in Figure 3a, node 90 (H-GSM) and node 58 (GSM) behave similarly in the beginning, however node 90 had a surge from t = 2 to t = 5. In a while, node 90 may infect others more than node 58 over time.
Netscience1. The lists produced by the various approaches vary in terms of the rank of each node. The first three lists spanning DC, IGSM, and H-GSM are largely identical, with the main change being the order in which the items occur. When node 58 (H-GSM) is compared to node 14 (GSM), as shown in Figure 3b, we can observe that node 58 performs better since it spreads quicker and further over time.
Email. It seems that the top five GSM, IGSM, and H-GSM nodes all have similar ranks in this network. The same five nodes appear in all three DC, BC, and CC variations. As illustrated in Figure 3, node 15 (H-GSM) has a quicker node effect than node 298 (GSM), despite their long-term behaviour being very comparable (GSM) (c).

Conclusion and future recommendations
When H-GSM findings are compared to GSM and other techniques, it is evident that H-GSM is superior. H-GSM is a hybrid strategy that improves on the GSM methodology by taking into account both the network's local and global structure, as represented by the DC and KS approaches, respectively. The capacity of a node to exchange information with all other nodes, as well as the node's own impact, are utilised as indicators to measure the node's influence in the network. The fundamental assumption is that the place and degree of connectedness of nodes are the major drivers of their influence. The more common nodes there are between two nodes, the closer they www.nature.com/scientificreports/ are, suggesting a better capacity to transfer information. The more significant a neighbour node is, the more it contributes to a node's influence.
To analyse the usefulness of the proposed strategy, we use the SIR model to simulate the propagation process and perform tests on the two major aspects of discrimination and accuracy in diverse real-world networks. First, when compared to other techniques, H-GSM has the lowest average computational complexity in terms of execution time. Second, the proposed technique illustrates that integrating local and global information may successfully decouple a node's value by comparing the top ten most INs. Moreover, the suggested technique in www.nature.com/scientificreports/ the research outperforms others in ranking correlation, proving its high accuracy. Finally, by integrating the strengths of the DC and Ks indices and incorporating additional self-and global impact metrics, the proposed H-GSM algorithm improves on previous techniques. The addition of location and node data improves its capacity to recognise important nodes in a network. We think that our suggested technique will make a substantial contribution to the area of network analysis and will be beneficial in a variety of applications.
In conclusion, this paper has presented the H-GSM as a novel approach for analyzing complex networks. By considering both local and global information of each node, the H-GSM algorithm addresses the deficiencies of current techniques and offers a more comprehensive understanding of network structures. H-GSM algorithm presents a significant step forward in network analysis, enabling researchers to gain deeper insights into complex network structures and identify INs with improved accuracy and scalability. This is one of our early efforts by using the network's topological connection structure. Additionally, we will continue our study by evaluating other combinations and validation approaches in order to increase the performance of the methodologies offered.